Analysis of Relationship Between Sales and Income Data#

In this NoteBook, we will be analyzing the relationship between sales of tobacco and median income in each state in the years 2013, 2014, 2016, 2017, 2018, and 2019. We omitted the year 2015 because the original data source U.S._Chronic_Disease_Indicators__Tobacco.csv lacks the average sales data we need for analysis. We first restructured the average sales and the median income data into our desired data frame and displayed them. We then plot 6 choropleth graphs for average sales in each state in the years of 2013 to 2019 (excluding 2015), and 7 choropleth graphs for median income in each state in the years of 2013 to 2019. We put them into the outputs folder of the project. Finally, we compared the graphs that we generated, analyze them, and drew a conclusion from the graphs.

import pandas as pd
import plotly.express as px
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 1
----> 1 import pandas as pd
      2 import plotly.express as px

ModuleNotFoundError: No module named 'pandas'

Loading Data#

Average Sales Data#

# avg sales data

sales_DATA = pd.read_csv("data/U.S._Chronic_Disease_Indicators__Tobacco.csv", low_memory=False)

# sales for each state in years from 2013 to 2019
# filter question
states_sales = sales_DATA[sales_DATA["Question"] == "Sale of cigarette packs"]
# filter columns
states_sales = states_sales[['YearStart', 'LocationAbbr', 'LocationDesc', 'Question', 'DataValueUnit', 'DataValue']]
states_sales = states_sales.groupby(['YearStart', 'LocationDesc']).sum().reset_index()
# remove states GU, PR, US, VI since they are not actual states.
states_sales = states_sales.loc[states_sales['DataValue'] != 0.0]
states_sales.to_csv("data/states_sales.csv")

# change format of data
sub = states_sales.groupby(['LocationDesc', 'LocationAbbr', 'DataValueUnit']).apply(lambda row: pd.Series({"2013": row.DataValue[row.YearStart == 2013].values[0],
                                                                   "2014": row.DataValue[row.YearStart == 2014].values[0],
                                                                   # ignoring 2015 because missing data in the original source
                                                                   "2016": row.DataValue[row.YearStart == 2016].values[0],
                                                                   "2017": row.DataValue[row.YearStart == 2017].values[0],
                                                                   "2018": row.DataValue[row.YearStart == 2018].values[0],
                                                                    "2019": row.DataValue[row.YearStart == 2019].values[0]}))

sub =sub.reset_index(['LocationDesc', 'LocationAbbr', 'DataValueUnit'], drop=False)
sub.head()
LocationDesc LocationAbbr DataValueUnit 2013 2014 2016 2017 2018 2019
0 Alabama AL pack sales per capita 64.6 61.7 60.1 58.1 54.7 53.1
1 Alaska AK pack sales per capita 39 37.2 35.6 32.9 31.1 30.4
2 Arizona AZ pack sales per capita 24.4 23 23.1 22.5 21.8 20.7
3 Arkansas AR pack sales per capita 57.5 54.4 54.2 52.4 50.2 47.6
4 California CA pack sales per capita 23.9 22.7 22 20.5 16.6 15.8

Median Income Data#

from project_tools.dataformat import income_data_import

Map plot for avg sales per state from 2013 - 2019 (6 graphs; 2015 omitted)#

# avg sales (sales per capita) for each year (6 in total)

# Year 2015 omitted
Years = [2013, 2014, 2016, 2017, 2018, 2019]
year = 0

# plotting 6 graphs in 2013, 2014, 2016, 2017, 2018, and 2019
for index in range(3, 9):
    
    ddf = sub.iloc[:, index].astype(float)
    
    fig =  px.choropleth(sub,
                         locations="LocationAbbr",
                         hover_name="LocationDesc",
                         color= ddf.values,
                         locationmode='USA-states',
                         color_continuous_scale='greens',
                         range_color = (80000, 40000))

    fig.update_layout(title_text = f"Sales per capita in {Years[year]}", geo_scope='usa')

    fig.write_html(f"output/avgSales{Years[year]}.html")
    year += 1
    fig.show()

Map plot for median income from 2013 - 2019 (7 graphs)#

# median income graph for each year (6 in total)
Years1 = [2013, 2014, 2015, 2016, 2017, 2018, 2019]
year = 0

# plotting 7 graphs in 2013 to 2019
for index in reversed(range(1, 8)):
    ddf = median_income.iloc[:, index].astype(float)

    fig = px.choropleth(median_income,
                         locations="state",
                         hover_name="Location",
                         color=ddf.values,
                         locationmode='USA-states',
                         color_continuous_scale='reds',
                         range_color = (80000, 40000))

    fig.update_layout(title_text = f"Median Income in {Years1[year]}", geo_scope='usa')

    fig.write_html(f"output/medianIncome{Years1[year]}.html")
    year += 1
    fig.show()

Analysis#

As we can see from the 6 graphs of average sales, the allocation of which states have the most tobacco pack sales per capita hasn’t changed much, with the state of New Hampshire, Kentucky, West Virginia, and Missouri having relatively the most packs sold in the years between 2013 and 2019 (excluding 2015). However, in the Median income graphs we see that the states with the highest median income are usually New Hampshire, Massachusetts, Connecticut, New Jersey, and Maryland. Just by looking at these data, it seems like there’s no direct relationship between the average amount of pack sold and median income in that state. A higher median income of a state doesn’t necessarily correlate with more packs of tobacco sold, and same with the opposite. We can see that the states with least amount of packs sold (New York, Utah, Washington, and California) are not the ones with the lowest median income (Mississippi, New Mexico, Louisiana, Arkansas, West Virginia, etc). Therefore, we failed to find any significant or meaningful relationship between the sales and income in each state just by visualizing the graphs and understanding which states has higher sales per capita and which states has higher median income. And thus we went one step further and conducted a regression analysis on the two data in the next Notebook.

## put them side by side

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=1, cols=2, subplot_titles=['income', 'sales'],
    specs=[[{"type": "choropleth"}, {"type": "choropleth"}]])


fig_1 = px.choropleth(median_income,
                      locations="state",
                      hover_name="Location", 
                      color="2013 Median income", 
                      locationmode='USA-states',
                      color_continuous_scale='reds',
                      range_color = (80000, 40000))
fig_1.update_layout(title_text = "Median Income in 2013", geo_scope='usa')


fig_2 = px.choropleth(sub, locations="LocationAbbr",
                      hover_name="LocationDesc",
                      color="2013",
                      locationmode='USA-states',
                      color_continuous_scale='greens',
                      range_color = (80000, 40000))
fig_2.update_layout(title_text = "avg sales in 2013", geo_scope='usa')


fig.add_trace(fig_1['data'][0], row=1, col=1)
fig.add_trace(fig_2['data'][0], row=1, col=2)

fig.show()
from IPython.display import HTML
# HTML(filename='output/medianIncome2013.html')
# HTML(filename='output/medianIncome2014.html')

def putHTML(year, avg_sales):
    '''Takes in a year & whether the user wants average sales or median income data, and display the html'''
    if avg_sales:
        return HTML(filename = f'output/avgSales{year}.html')
    else:
        return HTML(filename= f'output/medianIncome{year}.html')


putHTML(year = 2013, avg_sales = True)